Redefining similarity in a thesaurus by using corpora

نویسنده

  • Hiroyuki Shinnou
چکیده

The aim of this paper is to automatically define the similarity I)etween two nouns which are generally used in various domains. By these similarities, we can construct a large and general thesaurus. In applications of natural language processing, it is necessary to appropriately measure the similarity between two nouns. The similarity is usually calculated from a thesaurus. Since a handmade thesaurus is not slfitahle for machine use, and expensive to compile, automatical construction of~a thesaurus has been a t tempted using corpora (Hindle, 1990). llowever, the thesaurus constructed by such ways does not contain so many nouns, and these nouns are specified by the used corpus. In other words, we cannot construct the general thesaurus from only a corpus. This can be regarded as data sparseness problem that few nouns appear in the corpus. 9b overcome data sparseness, methods to estimate the distribution of unseen eooecurrence frorn the distribution of similar words in the seen cooccurrence has been proposed. Brown et al. proposed a class-based n-gram model, which generalizes the n-gram model, to predict a word from previous words in a text (Brown et al., 1992). They tackled data sparseness by generalizing the word to the class which contains the word. Pereira ct al. also basically used the above method, but they proposed a soft clustering scheme, in which membership of a word in a class is probabilistic (Pereira et al., 1993). Brown and Pereira provide the clustering algorithm assigning words to proper classes, based on their own models. I)agan eL al. proposed a similarity-based model in which each word is generalized, not to its own specific class, but to a set of words which are most similar to it (Dagan et al., 1993). Using this model, they successfully l)redieted which unobserved cooccurrenccs were more likely than others, and estimated the probability of the cooecurrences (Dagan et al., 1994). However, because these schemes look for similar words in the corpus, the number of similarities which we can define is rather small in comparison with the nunlber of similarities for pairs of the whole. The scheme to look for similar words in the corpus has already taken the influence of data sparseness. In this paper, we propose a method distinct from the above methods, which use a handmade thesaurus to find similar words. The proposed method avoids data sparseness by estimating undefined similarities from the similarity in the thesaurus and similarities defined by the corpus. Thus, the obtained similarities are the same in nmuber as the similarities in the thesaurus, and they reflect the particularity of the domain to which the used corpus belongs. The use of a tlmsaurus can obviously set up the similar word independent of the tort)us, and has an advantage that some ambiguities in analyzing the corpus are solved. We have experimented by using Bunrui-goihyon(Bmlrui-goi-hyon, 1994), which is a kind of Japanese handmade thesaurus, and the corpus which consists of Japanese economic newspaper 5 years articles with about 7.85 M sentences. We evaluate the appropriateness of the obtained similarities.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cross-Language High Similarity Search Using a Conceptual Thesaurus

This work addresses the issue of cross-language high similarity and near-duplicates search, where, for the given document, a highly similar one is to be identified from a large cross-language collection of documents. We propose a concept-based similarity model for the problem which is very light in computation and memory. We evaluate the model on three corpora of different nature and two langua...

متن کامل

Evaluation of the Sketch Engine Thesaurus on Analogy Queries

Recent research on vector representation of words in texts bring new methods of evaluating distributional thesauri. One of such methods is the task of analogy queries. We evaluated the Sketch Engine thesaurus on a subset of analogy queries using several similarity options. We show that Jaccard similarity is better than the cosine one for bigger corpora, it even substantially outperforms the wor...

متن کامل

Analysis and Construction of Noun Hypernym Hierarchies to Enhance Roget’s Thesaurus

Lexical resources are machine-readable dictionaries or lists of words, where semantic relationships between the terms are somehow expressed. These lexical resources have been used for many tasks such as word sense disambiguation and determining semantic similarity between terms. In recent years some research has been put into automatically building lexical resources from large corpora. In this ...

متن کامل

PLSI Utilization for Automatic Thesaurus Construction

When acquiring synonyms from large corpora, it is important to deal not only with such surface information as the context of the words but also their latent semantics. This paper describes how to utilize a latent semantic model PLSI to acquire synonyms automatically from large corpora. PLSI has been shown to achieve a better performance than conventional methods such as tf·idf and LSI, making i...

متن کامل

Building a Cross-lingual Relatedness Thesaurus using a Graph Similarity Measure

The Internet is an ever growing source of information stored in documents of different languages. Hence, cross-lingual resources are needed for more and more NLP applications. This paper presents (i) a graph-based method for creating one such resource and (ii) a resource created using the method, a cross-lingual relatedness thesaurus. Given a word in one language, the thesaurus suggests words i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996